6 research outputs found
Minimally Supervised Categorization of Text with Metadata
Document categorization, which aims to assign a topic label to each document,
plays a fundamental role in a wide variety of applications. Despite the success
of existing studies in conventional supervised document classification, they
are less concerned with two real problems: (1) the presence of metadata: in
many domains, text is accompanied by various additional information such as
authors and tags. Such metadata serve as compelling topic indicators and should
be leveraged into the categorization framework; (2) label scarcity: labeled
training samples are expensive to obtain in some cases, where categorization
needs to be performed using only a small set of annotated data. In recognition
of these two challenges, we propose MetaCat, a minimally supervised framework
to categorize text with metadata. Specifically, we develop a generative process
describing the relationships between words, documents, labels, and metadata.
Guided by the generative model, we embed text and metadata into the same
semantic space to encode heterogeneous signals. Then, based on the same
generative process, we synthesize training samples to address the bottleneck of
label scarcity. We conduct a thorough evaluation on a wide range of datasets.
Experimental results prove the effectiveness of MetaCat over many competitive
baselines.Comment: 10 pages; Accepted to SIGIR 2020; Some typos fixe
Recommended from our members
Contextualized, Metadata-Empowered, Coarse-to-Fine Weakly-Supervised Text Classification
Text classification plays a fundamental role in transforming unstructured text data to structured knowledge. State-of-the-art text classification techniques rely on heavy domain-specific annotations to build massive machine(deep) learning models. Although these deep learning models exhibit superior performance, the lack of training data and expensive human effort in the manual annotation is a key bottleneck that forbids them from being adopted in many practical scenarios. To address this bottleneck, our research exploits the data and develops a family of data-driven text classification frameworks with minimal supervision, for e.g. class names, a few label-indicative seed words per class.The massive volume of text data and complexity of natural language pose significant challenges to categorizing the text corpus without human annotations. For instance, the user- provided seed words can have multiple interpretations depending on the context, and their respective user-intended interpretation has to be identified for accurate classification. Moreover, metadata information like author, year, and location is widely available in addition to the text data, and it could serve as a strong, complementary source of supervision. However, leveraging metadata is challenging because (1) metadata is multi-typed, therefore it requires systematic modeling of different types and their combinations, (2) metadata is noisy, some metadata entities (e.g., authors, venues) are more compelling label indicators than others. And also, the label set is typically assumed to be fixed in traditional text classification problems. However, in many real-world applications, new classes especially more fine-grained ones will be introduced as the data volume increases. The goal of our research is to create general data-driven methods that transform real-world text data into structured categories of human knowledge with minimal human effort.This thesis outlines a family of weakly supervised text classification approaches, which upon combining can automatically categorize huge text corpus into coarse and fine-grained classes, with just label hierarchy and a few label-indicative seed words as supervision. Specifically, it first leverages contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of a seed word, and thus result- ing in contextualized weak supervision. Then, to leverage metadata, it organizes the text data and metadata together into a text-rich network and adopt network motifs to capture appropriate combinations of metadata. Finally, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. We have performed extensive experiments on real-world datasets from different domains. The results demonstrate significant advantages of using contextualized weak supervision and leveraging metadata, and superior performance over baselines
Recommended from our members
Contextualized, Metadata-Empowered, Coarse-to-Fine Weakly-Supervised Text Classification
Text classification plays a fundamental role in transforming unstructured text data to structured knowledge. State-of-the-art text classification techniques rely on heavy domain-specific annotations to build massive machine(deep) learning models. Although these deep learning models exhibit superior performance, the lack of training data and expensive human effort in the manual annotation is a key bottleneck that forbids them from being adopted in many practical scenarios. To address this bottleneck, our research exploits the data and develops a family of data-driven text classification frameworks with minimal supervision, for e.g. class names, a few label-indicative seed words per class.The massive volume of text data and complexity of natural language pose significant challenges to categorizing the text corpus without human annotations. For instance, the user- provided seed words can have multiple interpretations depending on the context, and their respective user-intended interpretation has to be identified for accurate classification. Moreover, metadata information like author, year, and location is widely available in addition to the text data, and it could serve as a strong, complementary source of supervision. However, leveraging metadata is challenging because (1) metadata is multi-typed, therefore it requires systematic modeling of different types and their combinations, (2) metadata is noisy, some metadata entities (e.g., authors, venues) are more compelling label indicators than others. And also, the label set is typically assumed to be fixed in traditional text classification problems. However, in many real-world applications, new classes especially more fine-grained ones will be introduced as the data volume increases. The goal of our research is to create general data-driven methods that transform real-world text data into structured categories of human knowledge with minimal human effort.This thesis outlines a family of weakly supervised text classification approaches, which upon combining can automatically categorize huge text corpus into coarse and fine-grained classes, with just label hierarchy and a few label-indicative seed words as supervision. Specifically, it first leverages contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of a seed word, and thus result- ing in contextualized weak supervision. Then, to leverage metadata, it organizes the text data and metadata together into a text-rich network and adopt network motifs to capture appropriate combinations of metadata. Finally, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. We have performed extensive experiments on real-world datasets from different domains. The results demonstrate significant advantages of using contextualized weak supervision and leveraging metadata, and superior performance over baselines
Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation
Manually annotating datasets requires domain experts to read through many
documents and carefully label them, which is often expensive. Recently,
pre-trained generative language models (GLMs) have demonstrated exceptional
abilities in generating text which motivates to leverage them for generative
data augmentation. We improve generative data augmentation by formulating the
data generation as context generation task and use question answering (QA)
datasets for intermediate training. Specifically, we view QA to be more as a
format than of a task and train GLMs as context generators for a given question
and its respective answer. Then, we cast downstream tasks into question
answering format and adapt the fine-tuned context generators to the target task
domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which
is further used as synthetic training data for their corresponding tasks. We
perform extensive experiments, case studies, and ablation studies on multiple
sentiment and topic classification datasets and demonstrate substantial
improvements in performance in few-shot, zero-shot settings. Remarkably, on the
SST-2 dataset, intermediate training on SocialIQA dataset achieves an
improvement of 40% on Macro-F1 score. Through thorough analyses, we observe
that QA datasets that requires high-level reasoning abilities (e.g.,
abstractive and common-sense QA datasets) tend to give the best boost in
performance in both few-shot and zero-shot settings
LOPS: Learning Order Inspired Pseudo-Label Selection for Weakly Supervised Text Classification
Weakly supervised text classification methods typically train a deep neural
classifier based on pseudo-labels. The quality of pseudo-labels is crucial to
final performance but they are inevitably noisy due to their heuristic nature,
so selecting the correct ones has a huge potential for performance boost. One
straightforward solution is to select samples based on the softmax probability
scores in the neural classifier corresponding to their pseudo-labels. However,
we show through our experiments that such solutions are ineffective and
unstable due to the erroneously high-confidence predictions from poorly
calibrated models. Recent studies on the memorization effects of deep neural
models suggest that these models first memorize training samples with clean
labels and then those with noisy labels. Inspired by this observation, we
propose a novel pseudo-label selection method LOPS that takes learning order of
samples into consideration. We hypothesize that the learning order reflects the
probability of wrong annotation in terms of ranking, and therefore, propose to
select the samples that are learnt earlier. LOPS can be viewed as a strong
performance-boost plug-in to most of existing weakly-supervised text
classification methods, as confirmed in extensive experiments on four
real-world datasets